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Abstract. Recent years have demonstrated that using random feature 
maps can significantly decrease the training and testing times of kernel- 
based algorithms without significantly lowering their accuracy. Regret¬ 
tably, because random features are target-agnostic, typically thousands 
of such features are necessary to achieve acceptable accuracies. In this 
work, we consider the problem of learning a small number of explicit poly¬ 
nomial features. Our approach, named Tensor Machines, finds a parsimo¬ 
nious set of features by optimizing over the hypothesis class introduced 
by Kar and Karnick for random feature maps in a target-specific manner. 
Exploiting a natural connection between polynomials and tensors, we 
provide bounds on the generalization error of Tensor Machines. Empir¬ 
ically, Tensor Machines behave favorably on several real-world datasets 
compared to other state-of-the-art techniques for learning polynomial 
features, and deliver significantly more parsimonious models. 


1 Introduction 

Kernel machines are one of the most popular and widely adopted paradigms 
in machine learning and data analysis. This success is due to the fact that an 
appropriately chosen non-linear kernel often succeeds in capturing non-linear 
structures inherent to the problem without forming the high-dimensional fea¬ 
tures necessary to explicitly delineate those structures. Unfortunately, the cost 
of kernel methods scales like 0{ii?), where n is the number of datapoints. Be¬ 
cause of this high computational cost, it is often the case that kernel methods 
cannot exploit all the information in large datasets; indeed, even the 0{n?) cost 
of forming and storing the kernel matrix can be prohibitive on large datasets. 

As a consequence, methods for approximately fitting non-linear kernel meth¬ 
ods have drawn much attention in recent years. In particular, explicit random 
feature maps have been proposed to make large-scale kernel machines practi¬ 
cal |14l7ll;ilfil22l2^ . The idea behind this method is to randomly choose an 
r-dimensional feature map cfi that satisfies fc(x, y) = E [^(x)^^(y)] , where k 
is the kernel of interest [T3]. Let ^ be the matrix whose rows comprise the 




application of </> to n datapoints, then the kernel matrix K has the low-rank 
approximation This approximation considerably reduces the costs of both 

fitting models, because one can work with the n-by-r matrix ^ instead of K, 
and of forming the input, because one need form only # instead of K. 

The success of this approximation hinges on choosing cj) so that 4>{x)'^4>{y) 
is close to fc(x, y) with high probability. One easy way to accomplish this is 
to take 0(x) = ..., (/)r(x)), where the random features (j)i all satisfy 

E [(^i(x)(/)i(y)] = /c(x,y). The concentration of measure phenomenon then en¬ 
sures that |fc(x, y) — ^(x)^^(y)| is small with high probability. In practice, one 
must use a large number of random features to achieve performance comparable 
with the exact kernel: r must be on the order of tens or even hundreds of thou¬ 
sands [5] . The reason for this is simple: since knowledge of the target function is 
not used in generating the feature map 4>, to ensure good performance enough 
random features must be selected to get good approximations to arbitrary func¬ 
tions in the reproducing kernel Hilbert space associated with k. Thus under the 
random feature map paradigm, one trades off a massive reduction in the cost of 
optimization for the necessity of generating a large number of random features, 
each of which is rather uninformative about any given target. 

Implicit in the formulation of random feature maps is the assumption that 
the chosen class of features is expressive enough to cover the function space 
of interest, yet simple enough that features can be sampled randomly in an 
efficient manner. Given this assumption, it seems natural to question whether it 
is possible to directly select a much smaller number of features relevant to the 
given target in a computational efficient manner. 


Target-specific optimization vs target-agnostic randomization 

Given a kernel k and a parametrized family of features {4>uj}uj^n satis¬ 
fying fc(x, y) = E [(/)tj(x)^(/)tj(y)] where the expectation is with respect to 
some distribution on 17, can optimization over this hypothesis class to select 
r features give a parsimonious representation of a known target, with the 
same or less computational effort and less error than target-agnostic random 
sampling of w? 


This question has recently been answered in the affirmative for the case of the 
radial basis kernel |24I25) . In this work, we provide a positive answer to the above 
question in the case of polynomial kernels and the Kar-Karnick random features 
introduced in [7] (see Figured]). Polynomial kernels are of interest because, after 
the radial basis kernel, they are arguably the most widely used kernels in ma¬ 
chine learning applications. The Weierstrass approximation theorem guarantees 
than any smooth function can be arbitrarily accurately approximated by a poly¬ 
nomial [17] [Chapter 7], so in principle a polynomial kernel of sufficiently high 
degree can be used to accurately approximate any smooth target function. 

In this work, we use a natural connection between tensors and polynomials 
to introduce a restricted hypothesis class of polynomials corresponding to low- 
rank tensors. In the Tensor Machine paradigm, learning consists of learning 
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Fig. 1. The relative test error and relative running time (relative to kernel ridge 
regression on a subset of the training data) of several polynomial learning algorithms 
over 10 datasets. The median and first and third quartiles of the error and running 
times are shown. TM-Batch and TM-SFO are solvers for the Tensor Machine model 
class introduced in this work. See Section [S] for further discussion. 
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a regularized low-rank decomposition of a hidden tensor corresponding to the 
target polynomial. Given n training points in d-dimensional input space, the 
computational cost of fitting a degree q TM with a rank-rxM approximation 
for each degree (using a first-order algorithm) is ©(ndg^rxM), while the cost 
of fitting a predictor using the Kar-Karnick random feature maps and kernel 
regression approach of m is O^ndgr^^K+nr^^). In practice the xkk required for 
accurate predictions is typically on the order of thousands; by way of comparison, 
we show empirically that TMs typically require less than txm = 5 to achieve a 
comparable approximation error! 

We show experimentally that Tensor Machines strike a favorable balance be¬ 
tween expressivity and parsimony, and provide an analysis of Tensor Machines 
that establishes favorable generalization properties. The objective function for 
fitting Tensor Machines is nonconvex, but we show empirically that it is suf¬ 
ficiently structured that the process of fitting a TM is robust, efficient, and 
requires very few features. We demonstrate that our algorithm exhibits a more 
favorable time-accuracy tradeoff when compared to the random feature map 
approach to polynomial regression mm, as well as to the polynomial network 
algorithm of [3], and the recently introduced Apple algorithm for online sparse 
polynomial regression [T]. 

2 Prior Work 

Consider the estimation problem of fitting a polynomial function f to n i.i.d. 
training points (xi,yi) drawn from the same unknown distribution, formulated 




















as 

1 

/ = arg miii^g„_^-^£(/(x,),yz) + A||/||h,, (1) 

^ • 1 
1—1 

where Hq is the reproducing kernel Hilbert space of all polynomials of degree 
at most q and ^ is a specified loss function. An exact solution (to within a 
specified numerical precision) to this problem can be obtained in 0{n^) time 
using classical kernel methods. 

One approach in the literature has been to couple kernel methods with various 
techniques for approximating the kernel matrix. The underlying assumption is 
that the kernel matrix is numerically low-rank, with rank r <C n; typically these 
methods reduce the cost of kernel methods from 0{n^) to Oinr^). Nystrom 
approximations, sparse greedy approximations, and incomplete Cholesky factor¬ 
izations fall into this class EnnE]. In [2, Bach and Jordan observed that prior 
algorithms in this class did not exploit all the knowledge inherent in supervised 
learning tasks: namely, they did not exploit knowledge of the targets (classifica¬ 
tion labels or regression values). They showed that by exploiting knowledge of 
the target, one can construct low-rank approximations to the kernel matrix that 
have significantly smaller rank, with a computational cost that remains 0{nr^). 

Another approach to polynomial-based supervised learning relies upon the 
modeling assumption that the desired target function can be approximated well 
in the subspace of T-Lq consisting of sparse polynomials. Recall that a sparse 
degree-g polynomial in d variables is one in which only a few of the possible 
monomials have nonzero coefficients (there are exponentially in d many such 
monomials). The early work of Sanger et al. attempts to learn the monomials 
relevant to the target in an online manner by augmenting the current polyno¬ 
mial with interaction features |18) . The recent Apple algorithm of Agarwal et 
al. attempts to learn a sparse polynomial, also in an online manner, using a dif¬ 
ferent heuristic that selects monomials from the current polynomial to be used 
in forming the next term of the monomial [1]. The algorithms presented in [2] 
and [5] provide theoretical guarantees for fitting sparse polynomials, but their 
computational costs scale undesirably for large-scale learning. 

Polynomial fitting has also been tackled using the neural network paradigm. 
In [9], Livni et al. provide an algorithm for learning polynomial functions as deep 
neural networks which has the property that the training error is guaranteed to 
decrease at each iteration. This algorithm has the desirable properties that the 
network learns a linear combination of a set of polynomials that are constructed 
in a target-dependent way, and that the degree of the polynomial does not have 
to be specified in advance: instead, additional layers can be added to the network 
until the desired error threshold has been reached, with each layer increasing the 
degree of the predictor by one. Unfortunately, this algorithm requires careful 
tuning of the hyperparameters (number of layers, and the width of each layer). 
In the subsequent work [in], Livni et al. provide an algorithm for fitting cubic 
polynomials in an iterative manner using rank-one tensor approximations. It 
can be shown that, in fact, this algorithm greedily fits cubic polynomials in the 
Tensor Machine class we propose in this paper. 




Factorization Machines combine the expressivity of polynomial regression 
with the ability of factorization models to infer relationships from sparse train¬ 
ing data m- Quadratic Factorization Machines, as introduced by Rendle, are 
models of the form 

d d 

/(x) = Wo + -t-^ ^ {vi,Vj)xiXj, 

i—1 

where a vector is learned for each coordinate of the input x. These models 
have been applied with great success in recommendation systems and related 
applications with sparse input x. Quadratic FMs can be fit in time linear in the 
size of the data, the degree of the polynomial being fit, and the length of the 
vectors Vj, but can only represent nonlinear interactions that can be written as 
symmetric homogeneous polynomials (plus a constant). In particular, FMs can¬ 
not represent polynomials that involve monomials containing variables raised 
to powers higher than one. For instance, a:| cannot be represented as a Factor¬ 
ization Machine. Another drawback to Factorization Machines is that explicitly 
evaluating the sums involved for a degree-g FM requires 0{(P) operations. A 
computational manipulation allows quadratic FMs to be fit in linear time, and 
it has been claimed that similar manipulations for higher-order FMS 1|15)1. but 
no such generalizations have been documented. Perhaps for this reason, only 
second-order FMs have been used in the literature. 

The random feature maps approach to polynomial-based learning m ex¬ 
ploits the concentration of measure phenomenon to directly construct a random 
low-dimensional feature map that approximately decomposes K. Kar and Kar- 
nick provided the first random feature map approach to polynomial regression 
in [7] ; their key observation is that the degree-g homogeneous polynomial kernel 
fc(x, y) = (x, y)'J can be approximated with random features of the form 

t=l 

where the vectors w* are vectors of random signs; that is, random feature maps 
of the form ^ 

cf) = ..., (pr) 

yV 

satisfy the condition fc(x, y) = E [(/)(x)^0(y)] necessary for the random feature 
map approach. Pagh and Pham provided a qualitatively different random feature 
map for polynomial kernel machines, based on the tensorization of fast hashing 
transforms m . These TensorSketch feature maps often outperform Kar-Karnick 
feature maps, but both methods require a large r. The CRAFTMaps approach 
of combines either TensorSketch or Kar-Karnick random feature maps with 
fast Johnson-Lindenstrauss transforms to significantly reduce the number of 
features needed for good performance. 

Of these methods, the Tensor Machine approach introduced in this paper 
is most similar to the Factorization Machine and Kar-Karnick random feature 
map approaches to polynomial learning. 


3 Tensors and polynomials 


To motivate Tensor Machines, which are introduced in the next section, we 
first review the connection between polynomials and tensors. For simplicity we 
consider only homogeneous degree-g polynomials: the monomials of such a poly¬ 
nomial all have degree q. We show that Kar-Karnick predictors and Factorizaton 
Machines correspond to specific tensor decompositions. 

Recall that a tensor T is a multidimensional array. 


The number of indices into the array, g, is also called the degree of the tensor. 
The inner product of two conformal tensors T and S is obtained by treating 
them as two vectors: 
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The Segre outer product of vectors uii G R'*!,..., w, G is the dixd 2 X - ■ -xdq 
tensor that satishes 


The degree-g self-outer product of a vector uj is denoted by 
Given a decomposition for a tensor T of the form 


i=l 

where r is minimal, r is called the rank of the tensor. A tensor T is supersym¬ 
metric if there exists a decomposition of the form 


The tensor comprises all the degree-g monomials in the variable x, so any 
degree-g homogeneous polynomial / satisfies /(x) = (T,x('J)) for some degree-g 
tensor T; likewise, any degree-g tensor T determines a homogenous polynomial / 
of degree g. This equivalence between homogeneous polynomials and tensors al¬ 
lows us the attack the problem of polynomial learning as one of learning a tensor. 
The Kar-Karnick random feature maps and Factorization Machine approaches 
can both be viewed through this lens. 


A single Kar-Karnick random polynomial feature corresponds to a rank-one 
tensor: 


q q / d 

j=i j=i \i=i 
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Accordingly, predictors generated by the random feature maps approach with 
the Kar-Karnick feature map (p satisfy 

r j r 

/(x) = X = ( X *^2* 

i=l \i=l 

for some coefficient vector a. That is, Kar-Karnick predictors correspond to 
tensors in the span of r randomly sampled degree-g rank-one tensors. 
Factorization Machines fit predictors of the form 

/(x) = ^(v,Vi 

where the factors Vi are vectors in R™ and 

m 

(vzi,...v,J = ■ ■ (v*Jj 

is the generalization of the inner product of two vectors. Let V be the d x m 
matrix with rows comprising the vectors Vi, and define the d-dimensional vector 
u)j to be the jth column of V. It follows that FMs can be expressed in the form 
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where the tensor R is defined so that its nonzero entries cancel the entries of 
that correspond to coefiicients of monomials like X 1 X 2 that contain a 
variable raised to a power larger than one. The vectors ujj are learned during 
training (the tensor R is implicitly defined by these vectors). 


4 Tensor Machines 

Thus, the Kar-Karnick random feature map approach searches for a low-rank 
tensor corresponding to the target polynomial and the Factorization Machine 
approach attempts to provide a supersymmetric low-rank tensor plus sparse 
tensor decomposition of the target polynomial. The Kar-Karnick approach has 
the drawback that it tries to find a good approximation in the span of a set 
of random rank-one tensors, so many such basis elements must be chosen to 
ensure that an unknown target can be represented well. Factorization Machines 
circumvent this issue by directly learning the tensor decomposition, but impose 
strict constraints on the form of the fitted polynomial that are not appropriate 
for general learning applications. 

As an alternative to Kar-Karnick random feature maps and factorization 
machines, we propose to instead fit Tensor Machines by directly optimizing over 
rank-one tensors to directly form a low-rank approximation to the target tensor 
(i.e., the target polynomial). Since the targets are not, in general, homogenous 
polynomials, we learn a different set of rank-one factors for each degree up to q. 
More precisely. Tensor Machines are functions of the form 

/(x)=a ;0 + (a;i,x) + ^/^<% 

p^2 \i^l 

obtained as minimizers to the following proxy to the kernel machine objective O: 

n q r p 

- ^ ^(/(x.), 2/0 + A||a;i^ + A E E E Il‘^ril2- ( 2 ) 

i=l p=2 i=l j=l 

By construction. Tensor Machines couple the expressiveness of the Kar-Karnick 
random feature maps model with the parsimony of Factorization Machines. 
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5 Generalization Error 

In this section we argue that fitting Tensor Machines using empirical risk min¬ 
imization makes efficient use of the training data. We show that the observed 
risk of Tensor Machines converges to the expected risk at the optimal rate, thus 
indicating that empirical risk minimization is an efficient method for finding lo¬ 
cally optimal Tensor Machines (assuming the optimization method avoids saddle 
points). For convenience, we consider a variant of Tensor Machines where the 


are constrained: 


norms of the vectors 


minimize - ^ + (w \ x,;) + ^ (Tp , ^ 


minimize 



r 


subject to Tp = ^ • • • • • ujP'^ for p>2, 


||‘^f ’'’||2 < B for allp,j, i. 


This formulation is not equivalent to the formulation given in ([2]) , but by taking 
B to be the norm of the largest constituent vector in the fitted Tensor Ma¬ 
chine, any bound on the generalization error of this formulation applies to the 
generalization error of Tensor Machines obtained by minimizing ©. 

Recall that the Rademacher TZn (B) complexity of a function class J" measures 
how well random noise can be approximated using a function from B [3]. 

Definition 1 (Rademacher complexity). Given a function class F and i.i.d. 
random variables the Rademacher complexity of J-, TZn^B), is defined as 



n 


where the Ui are independent Rademacher random variables (random variables 


taking values ±1 with equal probability). 

Well-known results (e.g. [3] [Theorems 7 and 8]) state that, with high prob¬ 
ability over the training data, the observed classification and regression risks 
are within 0{TZn{iF) -I- Ij^/n) of the true classification and regression risks. In 
fact, the optimal rate of convergence for the observed risk to the expected risk is 
0{l/y/n), which can only be achieved when the Rademacher complexity of the 
hypothesis class is 0{l/^/n) [11]. 

Our main observation is that the Rademacher complexity of a Tensor Ma¬ 
chines grows at the rate (^(l/i/n), so the empirically observed estimate of the 
Tensor Machine risk converges to the expected risk at the optimal rate. 

Theorem 1. Let iFd,q,r,B denote the class of degree-q, rank-r Tensor Machines 
on with constituent vectors constrained to lie in the ball of radius B: 



p=2 


r 






Let (x,y) be distributed aecording to P, and assume ||x ||2 < almost surely. 
The Rademacher complexity, with respect to (x, y), of this hypothesis class sat¬ 
isfies 


d,q,r,B') ^ 


cr(l + 8B q'^ {ytffdAnd + \/d) 

y/n 


where c is a constant. 


This result follows from reformulating the Rademacher complexity as the 
spectral norm of a Rademacher sum of data-dependent tensors and then applying 
a recent bound from |12j on the spectral norm of random tensors. A full proof 
is provided in the appendix. 


6 Empirical Evaluations 

In this section, we evaluate the performance of Tensor Machines on several real- 
world regression and classification datasets and demonstrate their attractiveness 
in learning polynomial features relative to other state-of-the-art algorithms. 

6.1 Experimental setup 

The nonconvex optimization problem ([5|) is central to fitting Tensor Machines. 

We consider two solvers for ©• The first uses the implementation of L-BFGS 
provided in Mark Schmidt’s minFuncH MATLAB package for optimization m- 
Since L-BFGS is a batch optimization algorithm, we also investigate the use of 
SFO, a stochastic quasi-Newton solver designed to work with minibatches [20] . 

We use the reference implementation of SFO provided by Sohl-Dickstein 0. We 
refer to the two algorithms used to fit Tensor Machines as TM-Batch and TM- 
SFO, respectively. 

The choice of initialization strongly influences the performance of both TM- 
Batch and TM-SFO. Accordingly, we used initial points sampled from a A/’(0, a^) 
distribution. The variance as well as the regularization parameter A in (|2|) 
are set by tuning. 

We compared TM-Batch and TM-SFO against several recent algorithms for 
learning polynomials: CRAFTMaps |5], Basis Learner and Apple [Tj. To pro¬ 
vide a baseline, we also used Kernel Ridge Regression (KRR) to learn polyno¬ 
mials; in cases where the training set contained more than 40,000 points, we 
randomly chose a subset of size 40,000 to perform KRR. For CRAFTMaps, the 
up-projection dimensionality is set to be 4 times the down-projection dimension¬ 
ality. We did not obtain reasonable predictions using the original implementation 
of Apple in vowpal wabbit, so we implemented the model in MATLAB and used 
this implementation to train and test the model, but reported the running time 
of the VW implementation (with the same choice of parameters). The remaining 

® The 2012 release, retrieved from http://www.cs.ubc. ca/~schinidtm/Software/minFunc.html 
https : //github. com/Sohl-Dickstein/Sum-of-Functions-Dptimizer/blob/master/README .md 





Name 

Train/Test split 

d 

Type 

q 

Indoor 

19937/11114 

520 

regression 

3 

Year 

463715/51630 

90 

regression 

2 

Census 

18186/2273 

119 

regression 

2 

Slice 

42291/10626 

384 

regression 

5 

Buzz 

466600/116650 

77 

regression 

3 

Gisette 

6000/1000 

5000 

binary 

3 

Adult 

32561/16281 

122 

binary 

3 

Forest 

522910/58102 

54 

binary 

4 

eBay search 

500000/100000 

478 

binary 

3 

Cor-rna 

59535/157413 

8 

binary 

4 


Table 1. Description of datasets. The target degree q was selected by minimizing the 
KRR test error on each dataset (or a subset). 


methods are implemented in pure MATLAB. The experiments were conducted 
on a machine with four 6-core Intel Xeon 2 GHz processors and 128 GB RAM. 

Because the performance of nonlinear learning algorithms can be very data- 
dependent, we tested the methods on a collection of 10 publicly available datasets 
to provide a broad characterization of the behavior of these algorithms. A sum¬ 
mary of the basic properties of these datasets can be found in Table [T] For each 
dataset, the target degree r was chosen to be the value that minimized the KRR 
test error. We proprocessed the data by first normalizing the input features to 
have similar magnitudes — viz., so that each column of the training matrix 
has unit Euclidean norm — then scaling each datapoint to have unit Euclidean 
norm. 

For regression tasks, we used the squared loss, and for binary classification 
tasks we used the logistic loss. For regression tasks, the test error is reported as 
||y~ 2 /*|| 2 /||y *||2 where y and y* are the prediction and ground truth, respectively; 
inaccuracy is reported for classification tasks. 

Each algorithm is governed by several interacting parameters, but for each 
algorithm we identify one major parameter: the number of iterations for TM- 
Batch, the number of epochs for TM-SFO, the number of features for GRAFTMaps, 
the layer width for Basis Learner, and the expansion parameter for Apple. We 
choose optimal values for the non-major parameters through a 10-fold cross- 
validation on the training set. We varied the major parameter over a wide range 
and recorded the test error and running times for the different values until the 
performance of the algorithms saturated. By saturation, we mean that the im¬ 
provement in the test error is less than 0.02 x err(krr) where err(krr) denotes 
the test error of KRR. For each combination of parameters, the average test 
errors and running times over 3 independent trials are reported. 








6.2 Overall performance 


To compare their performance, we applied TM-Batch, TM-SFO, CRAFTMaps, 
Basis Learner, and Apple to the 10 datasets listed in Table[TJ We did not evaluate 
Factorization Machines because efficient algorithms for fitting FM models are 
only available in the literature for the case g = 2. It is unclear whether FMs 
with higher values of q can be fit efficiently. 

We present the computation/prediction accuracy tradeoffs of the considered 
algorithms in Figure [T] The data plotted are the test errors and running times 
of the algorithms relative to those of KRR: 

relerr = (err(alg) — err(krr))/err(krr) 


and 

reltime = time(alg)/time(krr). 

The median values of relerr and reltime as well as the corresponding first and 
third quartiles are shown. 

The detailed results can be found in TablelH The performance of TM-Batch is 
consistent across the datasets in the sense that it provides reliable predictions in a 
fairly short amount of time. TM-SFO converges to lower quality solutions than 
TM-Batch, but has lower median and variance in runtime. CRAFTMaps and 
Basis Learner take significantly more time to yield solutions almost as accurate 
as the two Tensor Machine algorithms. As one might expect, due to its greedy 
nature, Apple is the fastest of the algorithms (the time of the vowpal wabbit 
implementation is reported), but of all the algorithms, TM-Batch and TM-SFO 
deliver the lowest worst-case relative errors. 


Name 

KRR 

TM-Batch 

TM-SFO 

CRAFTMaps 

Basis Learner 

Apple 

Indoor 
Year 
Census 
Slice 
Buzz 
Gisette 
Adult 
Forest 
eBay search 
Cor-rna 

0.00961/50 

0.00485/310 

0.0667/46 

0.0118/441 

0.407/624 

0.027/6.6 

0.150/182 

0.148/361 

0.197/446 

0.0446/423 

0.00802/3.2(4) 

0.00565/85(5) 

0.0774/12(5) 

0.0229/80(3) 

0.407/568(2) 

0.0266/16(4) 

0.149/0.9(4) 

0.184/657(5) 

0.197/612(5) 

0.0453/8.4(3) 

0.0182/3.9(4) 
0.00484/74(5) 
0.0802/45(5) 
0.02 1 3/73(3) 
0.409/440(2) 
0.0254/29(4) 
0.151/2.0(4) 
0.178/569(5) 
0.192/349(5) 
0.0489/7.2(3) 

0.00559/2.7(300) 

0.00494/39(200) 

0.0767/7.3(1000) 

0.0465/707(7000) 

0.408/2105(1200) 

0.0300/63(11500) 

0.154/17(700) 

0.196/1023(750) 

0.281/1054(800) 

0.0462/19(200) 

0.00241/97 

0.00496/67 
0.0788/8.1 
0.0501/1034 
0.373/2472 
0.0270/62 
0.149/ 30 
0.180/3494 
0.281/1642 
0.0493/1.7 

0.00712/6.5 

0.00489/60 

0.0950/7.3 

0.0788/46 

0.496/325 

0.0240/32 

0.150/0.8 

0.219/14 

0.269/122 

0.0489/4.2 


Table 2. Test error/running time of each methods on 10 datasets. The rank parameter 
r used in TM-Batch and TM-SFO and number of features used in CRAFTMaps is 
listed in the parenthesis. For Apple, the test error is computed based on a MATLAB 
implementation of Apple but the running time is recorded by using the vowpal wabbit 
framework with the same choice of parameters. 


Both TM solvers required less than r = 5 rank-one TM features for each 
individual degree fit for all datasets. Since only one dataset benefited from a 
quintic fit, the TM models required at most 1 + d + parameters 








to fit each dataset; this should be compared to the CRAFTMaps models, which 
required at least 400d < qrd parameters to fit each dataset, generally produced 
fits with higher error than the TM solvers, and required longer solution times. 

6.3 Effect of rank parameter r 

To investigate the effect of the rank parameter r on the test error of Tensor 
Machines, we evaluated both TM-Batch and TM-SFO on the Census and Slice 
datasets. On Census, since the target degree is 2, we evaluate Factorization 
Machines (FM) as well. The results are shown in Figure [21 

As expected, increasing r leads to smaller test errors. On Slice, where the 
target degree q = 5, the gap between the performance of TMs and KRR is 
relatively large (almost a factor of 2), while on Census, where q = 2, the gap 
is slighter.Interestingly, on Census, increasing rank does not lead to a higher 
accuracy after r > 3. We also observe on Census that the error of TMs is lower 
than that of FMs; this behaviour was also observed on the other datasets when 
q = 2. 
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Fig. 2. Test error of TM-Batch and TM-SFO with different rank parameters r on the 
Census (g = 2) and Slice (g = 5) datasets. The test error of using kernel ridge regression 
(KRR) is also plotted. On Census, we also evaluate Factorization Machines (FM). 


6.4 Time-accuracy tradeoffs 

To assess the time-accuracy tradeoffs of the various algorithms, in Figure [3] we 
plot the test error vs the training time for the Slice and Forest datasets. The 
training time is determined by varying the settings of each algorithm’s major 
parameter. 

TM-Batch and TM-SFO compare favorably against the other methods: they 
either produce a much lower error than the other methods (on Slice) or reach a 
considerably low error much faster (on Forest). Similar patterns were observed 
on most of the datasets we considered. 
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Fig. 3. Evaluation of time-accuracy tradeoffs of the algorithms on the Slice {q = 5) 
and Forest {q = 4) datasets. The rank parameters r for TM-Batch and TM-SFO on 
Slice and Forest are 3 and 5, respectively. 


6.5 Scalability of TM-SFO on eBay search dataset 

Since SFO needs to access only a small mini-batch of data points per update, it 
is suitable for fitting TMs on datasets which cannot fit in memory all at once. 
Here we explore the scalability of TM-SFO on a private eBay search dataset of 
1.2 million data points with dimension 478. Each data point {ua,ui,) comprises 
the feature vectors associated with a pair of items that were returned as the 
results of a search on the eBay website and were subsequently visited. The goal 
is to fit a polynomial for the task of classifying which item was clicked on first: 
a or b. In our experiments, we randomly selected 100,000 data points to be the 
test set; training sets of variable size were selected from the remainder. 

For comparison, we also evaluate CRAFTMaps on the same task. In TM- 
SFO, we fix the rank parameter r to be 5 and number of epochs to be 50. In 
CRAFTMaps, we fix the number of random features to be 800. These parameters 
are the optimal settings from the experiments used to generate Table O We 
report both the classification error and training time as the size of training set 
grows. 

The results are shown in Figure 01 We observe that the running time of 
TM-SFO grows almost linearly with the size of the training set, while that 
of CRAFTMaps grows superlinearly. Also, as expected, becanse CRAFTMaps 
choose the hypothesis space independently of the target, increasing the size of 
the training set without also increasing the size of the model does not give 
much gain in performance past 100,000 training points. We see that the target- 
adaptive nature of TMs endows TM-SFO with two advantages. First, for a fixed 
amount of training data the errors of TMs are significantly lower than those of 
CRAFTMaps. Second, because the hypothesis space evolves as a function of the 
target and training data, the training error of TM-SFO exhibits noticeable decay 
up to a training set size of about 500,000 points, long past the point where the 
CRAFTMaps test error has saturated. 
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Fig. 4. Scalability of TM-SFO on eBay search dataset (g = 3, r = 5). The number of 
epochs is set to be 50. The plot shows the test error (black solid line) and training time 
(red dash line) as a function of the number of training point used. 
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A Proof of Theorem [T] 


Theorem [T] is a corollary of the following bound on the Rademacher complexity 
of a rank-one Tensor Machine. 

Theorem 2. Let fFd,q,B denote the elass of degree-q rank-one Tensor Machines 
on with constituent vectors constrained to lie in the ball of radius B: 


^d,q,B = |/ : X 




) I W^jh <Bforj = l,...,g.| 


Let (x,?/) he distributed according to P, and assume ||x ||2 < B^ almost surely. 
The Rademacher complexity with respect to (x, y) of this hypothesis class satisfies 


B,n{J^d,q,B) < 


c(8BBx)‘^q{y/qd In d -1- \/d) 




where c is a constant. 


Proof (Proof of Theorem Qp. Every function in J-d,q,r,B can be written as the 
sum of a constant, a linear function of the form x i—^ with ||a;|| < i?, 

and r functions from each of Pd. 2 ,BT ■ ■ The Rademacher complexity 

of K is zero, and a straightforward calculation establishes that the Rademacher 
complexity of linear functions of the specified form is no larger than BB^/y/n. It 
follows from a simple structural result on Rademacher complexities [1] [Theorem 
12] that 
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Applying Theorem [2l we have that 
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Proof (Proof of Theorem\^. Given the data points xi,..., x„, let 
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so that TZn(P) = From the definition of Pd,q,B, we have that 
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where Qb = {u; £ | ||w ||2 < B}. 

Define Tg. = Recall the definition of the spectral norm of an 

order-g tensor T : 


|T||= sup |(T,Ui 
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It follows from this definition that 
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For simplicity, we drop the subscript tr from in the following. Note that 
E [T] = 0, since Rademacher variables are mean-zero. We now apply Theorem 2 
of [H] , which bounds the expected spectral norm of the difference between a ten¬ 
sor Rademacher sum and its expectation with the sum of the expected maximum 
Euclidean lengths of one-dimensional slices through the tensor: 


E, [||T||] =E, [||T-ET||] 


< q\a.d -\- 1) 




We replace the innermost sum with a maximum to obtain an estimate involving 
the expected size of the largest entry in T : 
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Next we bound the maximum entry of T with the Frobenius norm of T, and 
apply Jensen’s inequality to obtain 
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Recalling the definition of T, we have that 

2 


Ea 


max 


<E' (E” 

d n 

11.. ...1q — l i,j—l 

d n n 

= i: 

11.. ...1q — l i—1 2 = 1 


It is readily established that 
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for any rank-one tensor, so it follows that 
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From equations (O, (0), and (0), we conclude that 


TZn{J^) = lExi,.. 


.,XjT, 




< 


c{8BBx)‘^q{y^qd\nd + Vd) 
y/n 






